{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CnMJYir34Ntb"
   },
   "source": [
    "# 🔍 Hybrid Search with LanceDB Cloud\n",
    "\n",
    "🚀 **_If you haven’t signed up for LanceDB Cloud yet, click [here](https://cloud.lancedb.com) to get started!_**\n",
    "\n",
    "This notebook demonstrates how to implement **hybrid search** using LanceDB Cloud, combining the power of vector embeddings and full-text search with custom business logic. Designed for real-world search applications, this example leverages:\n",
    "\n",
    "- **OpenAI Embeddings** for semantic understanding\n",
    "- **LanceDB Cloud** for managed vector storage\n",
    "- **BeIR Benchmark Dataset** for scientific document retrieval evaluation\n",
    "\n",
    "## 🚀 Key Features\n",
    "| Component | Implementation |\n",
    "|-----------|-----------------|\n",
    "| **Hybrid Search** | Vector + FTS with RRF Reranking |\n",
    "| **Custom Filters** | Domain-specific result filtering |\n",
    "| **Managed Infrastructure** | LanceDB Cloud |\n",
    "| **Scientific Focus** | SCIDOCS Dataset |\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9N9tRap0YYp4"
   },
   "source": [
    "## Step 1: Install Required Libraries\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "AZHJyHexyHY3"
   },
   "outputs": [],
   "source": [
    "!pip install -U lancedb transformers datasets FlagEmbedding unstructured langchain -qq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Qm0a39PCPYDs"
   },
   "source": [
    "## Step 2: Obtain the API key from the [dashboard](https://cloud.lancedb.com) and Connect to LanceDB Cloud\n",
    "\n",
    "*  Get the `db uri`\n",
    "\n",
    "`db uri` starts with `db://`, which can be obtained from the project page on the dashboard. In the following example, `db uri` is `db://test-sfifxz`.\n",
    "\n",
    "![db-uri.png]()\n",
    "\n",
    "*  Get the `API Key`\n",
    "Obtain a LanceDB Cloud API key by clicking on the `GENERATE API KEY` from the `table` page.\n",
    "\n",
    "💡 Copy the code block for connecting to LanceDB Cloud that is shown at the last step of API key generation.\n",
    "![image.png]()\n",
    "\n",
    "*  Connect to LanceDB Cloud\n",
    "\n",
    "Copy and paste the `db uri` and the `api key` from the previous steps, or directly paste the code block for LanceDB Cloud connection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "6553603f"
   },
   "outputs": [],
   "source": [
    "uri = \"db://your-db-uri\"  # @param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "36ef9c45"
   },
   "outputs": [],
   "source": [
    "api_key = \"sk_...\"  # @param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kI_FMG_O79xy"
   },
   "source": [
    "paste your OPEN_AI_KEY"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vIVUoGCi8MD2"
   },
   "outputs": [],
   "source": [
    "openai_api_key = \"your-openai-api-key\"  # @param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0QQL4lm8lTzg"
   },
   "source": [
    "## Step 3: Import libraries\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vP6d6JUShgqo"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import lancedb\n",
    "import re\n",
    "import pandas as pd\n",
    "import random\n",
    "\n",
    "from datasets import load_dataset\n",
    "\n",
    "import torch\n",
    "import gc\n",
    "\n",
    "import lance\n",
    "\n",
    "\n",
    "import os\n",
    "\n",
    "import lancedb\n",
    "import openai\n",
    "from lancedb.embeddings import get_registry\n",
    "from lancedb.pydantic import LanceModel, Vector\n",
    "\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n",
    "\n",
    "\n",
    "embeddings = get_registry().get(\"openai\").create()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8eKRYd2F7v5n"
   },
   "source": [
    "## Step 4: Load `Chunks` of data from [BeIR Dataset](https://huggingface.co/datasets/BeIR/scidocs)\n",
    "\n",
    "Note: This is a dataset built specially for retrieval tasks to see how good your search is working"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "l0ezDr7suAf_"
   },
   "outputs": [],
   "source": [
    "# Load queries and corpus\n",
    "queries = load_dataset(\"BeIR/scidocs\", \"queries\")[\"queries\"].to_pandas()\n",
    "full_docs = (\n",
    "    load_dataset(\"BeIR/scidocs\", \"corpus\")[\"corpus\"].to_pandas().dropna(subset=[\"text\"])\n",
    ")\n",
    "docs = full_docs.head(64).copy()  # Explicitly create a new copy\n",
    "docs.loc[:, \"num_words\"] = docs[\"text\"].apply(lambda x: len(x.split()))\n",
    "docs.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HJf8xZmX8VJC"
   },
   "source": [
    "## Step 5: Connect to `LanceDB Cloud` and store embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5aljyqpUiViE"
   },
   "outputs": [],
   "source": [
    "class Documents(LanceModel):\n",
    "    vector: Vector(embeddings.ndims()) = embeddings.VectorField()\n",
    "    text: str = embeddings.SourceField()\n",
    "    title: str\n",
    "    num_words: int\n",
    "\n",
    "\n",
    "api_key = api_key\n",
    "uri = uri\n",
    "\n",
    "db = lancedb.connect(uri=uri, api_key=api_key, region=\"us-east-1\")\n",
    "\n",
    "table_name = \"documents\"\n",
    "table = db.create_table(\n",
    "    table_name, schema=Documents, mode=\"overwrite\"\n",
    ")  # create an emtpy table\n",
    "data = docs.apply(\n",
    "    lambda row: {\n",
    "        \"title\": row[\"title\"],\n",
    "        \"text\": row[\"text\"],\n",
    "        \"num_words\": row[\"num_words\"],\n",
    "    },\n",
    "    axis=1,\n",
    ").values.tolist()\n",
    "table.add(data)  # ingest docs with auto-vectorization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "l-LTnvqy_1jF"
   },
   "source": [
    "## Step 6: Build a Full Text Search (FTS) index\n",
    "\n",
    "ℹ️ Note that a FTS index is required for performing a hybrid search\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "IgJi1xhxAGnD"
   },
   "outputs": [],
   "source": [
    "table.create_fts_index(\"text\")  # Create a fts index before the hybrid search"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8oPffVhtbfRt"
   },
   "source": [
    "⚠️ WARNING: `create_fts_index` is asynchonous so it returns when indexing is in progress. We provide the `list_indices` and `index_stats` APIs to check index status. The index name is formed by appending “_idx” to the column name. Note that `list_indices` will not return any information until the index has fully ingested and indexed all available data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "w8_SIhNrbPYw",
    "outputId": "b37da26c-a59b-485c-d219-b84a0b0d5bd6"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "⏳ Waiting for text_idx to be ready...\n",
      "⏳ Waiting for text_idx to be ready...\n",
      "⏳ Waiting for text_idx to be ready...\n",
      "⏳ Waiting for text_idx to be ready...\n",
      "✅ text_idx is ready!\n",
      "IndexStatistics(num_indexed_rows=64, num_unindexed_rows=0, index_type='FTS', distance_type=None, num_indices=None)\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "\n",
    "\n",
    "def wait_for_index(table, index_name):\n",
    "    POLL_INTERVAL = 10\n",
    "    while True:\n",
    "        indices = table.list_indices()\n",
    "\n",
    "        if indices and any(index.name == index_name for index in indices):\n",
    "            break\n",
    "        print(f\"⏳ Waiting for {index_name} to be ready...\")\n",
    "        time.sleep(POLL_INTERVAL)\n",
    "\n",
    "    print(f\"✅ {index_name} is ready!\")\n",
    "\n",
    "\n",
    "index_name = \"text_idx\"\n",
    "wait_for_index(table, index_name)\n",
    "print(table.index_stats(index_name))\n",
    "# IndexStatistics(num_indexed_rows=64, num_unindexed_rows=0, index_type='FTS', distance_type=None, num_indices=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PCufm9Xr8eWp"
   },
   "source": [
    "## Step 7: Search from a random Text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AH2hOtrt485i"
   },
   "source": [
    "Let's first perform a Full-Text search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "964Z2sZA247g",
    "outputId": "55ec783e-24f6-4468-afba-f394cfcd9372"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "summary": "{\n  \"name\": \")\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"vector\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"Hierarchical Pitman-Yor Process priors are compelling methods for learning language models, outperforming point-estimate based methods. However, these models remain unpopular due to computational and statistical inference issues, such as memory and time usage, as well as poor mixing of sampler. In this work we propose a novel framework which represents the HPYP model compactly using compressed suffix trees. Then, we develop an efficient approximate inference scheme in this framework that has a much lower memory footprint compared to full HPYP and is fast in the inference time. The experimental results illustrate that our model can be built on significantly larger datasets compared to previous HPYP models, while being several orders of magnitudes smaller, fast for training and inference, and outperforming the perplexity of the state-of-the-art Modified Kneser-Ney countbased LM smoothing by up to 15%.\",\n          \"Software organizations have typically de-emphasized the importance of software testing. In this paper, the results of a regional survey of software testing and software quality assurance techniques are described. Researchers conducted the study during the summer and fall of 2002 by surveying software organizations in the Province of Alberta. Results indicate that Alberta-based organizations tend to test less than their counterparts in the United States. The results also indicate that Alberta software organizations tend to train fewer personnel on testing-related topics. This practice has the potential for a two-fold impact: first, the ability to detect trends that lead to reduced quality and to identify the root causes of reductions in product quality may suffer from the lack of testing. This consequence is serious enough to warrant consideration, since overall quality may suffer from the reduced ability to detect and eliminate process or product defects. Second, the organization may have a more difficult time adopting methodologies such as extreme programming. This is significant because other industry studies have concluded that many software organizations have tried or will in the next few years try some form of agile method. Newer approaches to software development like extreme programming increase the extent to which teams rely on testing skills. Organizations should consider their testing skill level as a key indication of their readiness for adopting software development techniques such as test-driven development, extreme programming, agile modelling, or other agile methods.\",\n          \"We introduce a long short-term memory recurrent neural network (LSTM-RNN) approach for real-time facial animation, which automatically estimates head rotation and facial action unit activations of a speaker from just her speech. Specifically, the time-varying contextual non-linear mapping between audio stream and visual facial movements is realized by training a LSTM neural network on a large audio-visual data corpus. In this work, we extract a set of acoustic features from input audio, including Mel-scaled spectrogram, Mel frequency cepstral coefficients and chromagram that can effectively represent both contextual progression and emotional intensity of the speech. Output facial movements are characterized by 3D rotation and blending expression weights of a blendshape model, which can be used directly for animation. Thus, even though our model does not explicitly predict the affective states of the target speaker, her emotional manifestation is recreated via expression weights of the face model. Experiments on an evaluation dataset of different speakers across a wide range of affective states demonstrate promising results of our approach in real-time speech-driven facial animation.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"Compressed Nonparametric Language Modelling\",\n          \"A survey of software testing practices in alberta\",\n          \"Speech-driven 3 D Facial Animation with Implicit Emotional Awareness : A Deep Learning Approach\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"num_words\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 51,\n        \"min\": 112,\n        \"max\": 236,\n        \"num_unique_values\": 5,\n        \"samples\": [\n          134,\n          236,\n          171\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"_score\",\n      \"properties\": {\n        \"dtype\": \"float32\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          6.727994918823242,\n          5.96092414855957,\n          5.996895790100098\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
       "type": "dataframe"
      },
      "text/html": [
       "\n",
       "  <div id=\"df-6790ee52-b0ab-488b-b194-ba0470440cb1\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>vector</th>\n",
       "      <th>text</th>\n",
       "      <th>title</th>\n",
       "      <th>num_words</th>\n",
       "      <th>_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[-0.012242989, -0.003041715, 0.022755764, -0.0...</td>\n",
       "      <td>Sociological and technical difficulties, such ...</td>\n",
       "      <td>Hipikat: a project memory for software develop...</td>\n",
       "      <td>210</td>\n",
       "      <td>9.177268</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[-0.013615985, 0.024568425, 0.0012754148, -0.0...</td>\n",
       "      <td>Hierarchical Pitman-Yor Process priors are com...</td>\n",
       "      <td>Compressed Nonparametric Language Modelling</td>\n",
       "      <td>134</td>\n",
       "      <td>6.727995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[-0.03583555, -0.00017553041, 0.009382936, -0....</td>\n",
       "      <td>We introduce a long short-term memory recurren...</td>\n",
       "      <td>Speech-driven 3 D Facial Animation with Implic...</td>\n",
       "      <td>171</td>\n",
       "      <td>5.996896</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[-0.03091255, -0.0026309614, 0.010364091, -0.0...</td>\n",
       "      <td>This report summarizes the objectives and eval...</td>\n",
       "      <td>SemEval-2015 Task 11: Sentiment Analysis of Fi...</td>\n",
       "      <td>112</td>\n",
       "      <td>5.986894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[0.0013170651, -0.022660466, 0.014392637, -0.0...</td>\n",
       "      <td>Software organizations have typically de-empha...</td>\n",
       "      <td>A survey of software testing practices in alberta</td>\n",
       "      <td>236</td>\n",
       "      <td>5.960924</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-6790ee52-b0ab-488b-b194-ba0470440cb1')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-6790ee52-b0ab-488b-b194-ba0470440cb1 button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-6790ee52-b0ab-488b-b194-ba0470440cb1');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "<div id=\"df-b61d660c-f42a-4a46-b0de-3e2f22b96f52\">\n",
       "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-b61d660c-f42a-4a46-b0de-3e2f22b96f52')\"\n",
       "            title=\"Suggest charts\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "     width=\"24px\">\n",
       "    <g>\n",
       "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
       "    </g>\n",
       "</svg>\n",
       "  </button>\n",
       "\n",
       "<style>\n",
       "  .colab-df-quickchart {\n",
       "      --bg-color: #E8F0FE;\n",
       "      --fill-color: #1967D2;\n",
       "      --hover-bg-color: #E2EBFA;\n",
       "      --hover-fill-color: #174EA6;\n",
       "      --disabled-fill-color: #AAA;\n",
       "      --disabled-bg-color: #DDD;\n",
       "  }\n",
       "\n",
       "  [theme=dark] .colab-df-quickchart {\n",
       "      --bg-color: #3B4455;\n",
       "      --fill-color: #D2E3FC;\n",
       "      --hover-bg-color: #434B5C;\n",
       "      --hover-fill-color: #FFFFFF;\n",
       "      --disabled-bg-color: #3B4455;\n",
       "      --disabled-fill-color: #666;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart {\n",
       "    background-color: var(--bg-color);\n",
       "    border: none;\n",
       "    border-radius: 50%;\n",
       "    cursor: pointer;\n",
       "    display: none;\n",
       "    fill: var(--fill-color);\n",
       "    height: 32px;\n",
       "    padding: 0;\n",
       "    width: 32px;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart:hover {\n",
       "    background-color: var(--hover-bg-color);\n",
       "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "    fill: var(--button-hover-fill-color);\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart-complete:disabled,\n",
       "  .colab-df-quickchart-complete:disabled:hover {\n",
       "    background-color: var(--disabled-bg-color);\n",
       "    fill: var(--disabled-fill-color);\n",
       "    box-shadow: none;\n",
       "  }\n",
       "\n",
       "  .colab-df-spinner {\n",
       "    border: 2px solid var(--fill-color);\n",
       "    border-color: transparent;\n",
       "    border-bottom-color: var(--fill-color);\n",
       "    animation:\n",
       "      spin 1s steps(1) infinite;\n",
       "  }\n",
       "\n",
       "  @keyframes spin {\n",
       "    0% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "      border-left-color: var(--fill-color);\n",
       "    }\n",
       "    20% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    30% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    40% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    60% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    80% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "    90% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "  }\n",
       "</style>\n",
       "\n",
       "  <script>\n",
       "    async function quickchart(key) {\n",
       "      const quickchartButtonEl =\n",
       "        document.querySelector('#' + key + ' button');\n",
       "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
       "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
       "      try {\n",
       "        const charts = await google.colab.kernel.invokeFunction(\n",
       "            'suggestCharts', [key], {});\n",
       "      } catch (error) {\n",
       "        console.error('Error during call to suggestCharts:', error);\n",
       "      }\n",
       "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
       "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
       "    }\n",
       "    (() => {\n",
       "      let quickchartButtonEl =\n",
       "        document.querySelector('#df-b61d660c-f42a-4a46-b0de-3e2f22b96f52 button');\n",
       "      quickchartButtonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "    })();\n",
       "  </script>\n",
       "</div>\n",
       "\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "                                              vector  \\\n",
       "0  [-0.012242989, -0.003041715, 0.022755764, -0.0...   \n",
       "1  [-0.013615985, 0.024568425, 0.0012754148, -0.0...   \n",
       "2  [-0.03583555, -0.00017553041, 0.009382936, -0....   \n",
       "3  [-0.03091255, -0.0026309614, 0.010364091, -0.0...   \n",
       "4  [0.0013170651, -0.022660466, 0.014392637, -0.0...   \n",
       "\n",
       "                                                text  \\\n",
       "0  Sociological and technical difficulties, such ...   \n",
       "1  Hierarchical Pitman-Yor Process priors are com...   \n",
       "2  We introduce a long short-term memory recurren...   \n",
       "3  This report summarizes the objectives and eval...   \n",
       "4  Software organizations have typically de-empha...   \n",
       "\n",
       "                                               title  num_words    _score  \n",
       "0  Hipikat: a project memory for software develop...        210  9.177268  \n",
       "1        Compressed Nonparametric Language Modelling        134  6.727995  \n",
       "2  Speech-driven 3 D Facial Animation with Implic...        171  5.996896  \n",
       "3  SemEval-2015 Task 11: Sentiment Analysis of Fi...        112  5.986894  \n",
       "4  A survey of software testing practices in alberta        236  5.960924  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = \"To confuse the AI and DNN embedding, let's put random terms from other sentences- automation training test memory?\"\n",
    "table.search(\n",
    "    query,\n",
    "    query_type=\"fts\",\n",
    ").limit(5).to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hSj3jHwi5ChE"
   },
   "source": [
    "Now let's perform a vector search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "jYTQMqqBA-DV",
    "outputId": "c2ab264f-0a77-456a-a631-83157010c24f"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "summary": "{\n  \"name\": \")\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"vector\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"a r t i c l e i n f o a b s t r a c t We present an automatic approach to the construction of BabelNet, a very large, wide-coverage multilingual semantic network. Key to our approach is the integration of lexicographic and encyclopedic knowledge from WordNet and Wikipedia. In addition, Machine Translation is applied to enrich the resource with lexical information for all languages. We first conduct in vitro experiments on new and existing gold-standard datasets to show the high quality and coverage of BabelNet. We then show that our lexical resource can be used successfully to perform both monolingual and cross-lingual Word Sense Disambiguation: thanks to its wide lexical coverage and novel semantic relations, we are able to achieve state-of the-art results on three different SemEval evaluation tasks.\",\n          \"The presented work aims at generating a systematically annotated corpus that can support the enhancement of sentiment analysis tasks in Telugu using wordlevel sentiment annotations. From OntoSenseNet, we extracted 11,000 adjectives, 253 adverbs, 8483 verbs and sentiment annotation is being done by language experts. We discuss the methodology followed for the polarity annotations and validate the developed resource. This work aims at developing a benchmark corpus, as an extension to SentiWordNet, and baseline accuracy for a model where lexeme annotations are applied for sentiment predictions. The fundamental aim of this paper is to validate and study the possibility of utilizing machine learning algorithms, word-level sentiment annotations in the task of automated sentiment identification. Furthermore, accuracy is improved by annotating the bi-grams extracted from the target corpus.\",\n          \"We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network\",\n          \"BCSAT : A Benchmark Corpus for Sentiment Analysis in Telugu Using Word-level Annotations\",\n          \"Deep Voice 2 : Multi-Speaker Neural Text-to-Speech\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"num_words\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 25,\n        \"min\": 126,\n        \"max\": 187,\n        \"num_unique_values\": 5,\n        \"samples\": [\n          133,\n          126,\n          151\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"_distance\",\n      \"properties\": {\n        \"dtype\": \"float32\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          0.4026884436607361,\n          0.43003538250923157,\n          0.4178524613380432\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
       "type": "dataframe"
      },
      "text/html": [
       "\n",
       "  <div id=\"df-610fb464-e5b8-409a-b365-05be94f9e89b\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>vector</th>\n",
       "      <th>text</th>\n",
       "      <th>title</th>\n",
       "      <th>num_words</th>\n",
       "      <th>_distance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[-0.029010361, -0.0019185242, 0.010176375, -0....</td>\n",
       "      <td>Classifying short texts to one category or clu...</td>\n",
       "      <td>Using deep learning for short text understanding</td>\n",
       "      <td>187</td>\n",
       "      <td>0.354978</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[-0.013530017, -0.0019778903, 0.009588092, -0....</td>\n",
       "      <td>a r t i c l e i n f o a b s t r a c t We prese...</td>\n",
       "      <td>BabelNet: The automatic construction, evaluati...</td>\n",
       "      <td>133</td>\n",
       "      <td>0.402688</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[-0.02732811, -0.00016068136, 0.01769485, -0.0...</td>\n",
       "      <td>We introduce a technique for augmenting neural...</td>\n",
       "      <td>Deep Voice 2 : Multi-Speaker Neural Text-to-Sp...</td>\n",
       "      <td>151</td>\n",
       "      <td>0.417852</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[-0.03583555, -0.00017553041, 0.009382936, -0....</td>\n",
       "      <td>We introduce a long short-term memory recurren...</td>\n",
       "      <td>Speech-driven 3 D Facial Animation with Implic...</td>\n",
       "      <td>171</td>\n",
       "      <td>0.421976</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[-0.010365823, -0.0064949114, 0.018580379, -0....</td>\n",
       "      <td>The presented work aims at generating a system...</td>\n",
       "      <td>BCSAT : A Benchmark Corpus for Sentiment Analy...</td>\n",
       "      <td>126</td>\n",
       "      <td>0.430035</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-610fb464-e5b8-409a-b365-05be94f9e89b')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-610fb464-e5b8-409a-b365-05be94f9e89b button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-610fb464-e5b8-409a-b365-05be94f9e89b');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "<div id=\"df-6839188e-c556-40cb-bd1d-5df38412a0b4\">\n",
       "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-6839188e-c556-40cb-bd1d-5df38412a0b4')\"\n",
       "            title=\"Suggest charts\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "     width=\"24px\">\n",
       "    <g>\n",
       "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
       "    </g>\n",
       "</svg>\n",
       "  </button>\n",
       "\n",
       "<style>\n",
       "  .colab-df-quickchart {\n",
       "      --bg-color: #E8F0FE;\n",
       "      --fill-color: #1967D2;\n",
       "      --hover-bg-color: #E2EBFA;\n",
       "      --hover-fill-color: #174EA6;\n",
       "      --disabled-fill-color: #AAA;\n",
       "      --disabled-bg-color: #DDD;\n",
       "  }\n",
       "\n",
       "  [theme=dark] .colab-df-quickchart {\n",
       "      --bg-color: #3B4455;\n",
       "      --fill-color: #D2E3FC;\n",
       "      --hover-bg-color: #434B5C;\n",
       "      --hover-fill-color: #FFFFFF;\n",
       "      --disabled-bg-color: #3B4455;\n",
       "      --disabled-fill-color: #666;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart {\n",
       "    background-color: var(--bg-color);\n",
       "    border: none;\n",
       "    border-radius: 50%;\n",
       "    cursor: pointer;\n",
       "    display: none;\n",
       "    fill: var(--fill-color);\n",
       "    height: 32px;\n",
       "    padding: 0;\n",
       "    width: 32px;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart:hover {\n",
       "    background-color: var(--hover-bg-color);\n",
       "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "    fill: var(--button-hover-fill-color);\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart-complete:disabled,\n",
       "  .colab-df-quickchart-complete:disabled:hover {\n",
       "    background-color: var(--disabled-bg-color);\n",
       "    fill: var(--disabled-fill-color);\n",
       "    box-shadow: none;\n",
       "  }\n",
       "\n",
       "  .colab-df-spinner {\n",
       "    border: 2px solid var(--fill-color);\n",
       "    border-color: transparent;\n",
       "    border-bottom-color: var(--fill-color);\n",
       "    animation:\n",
       "      spin 1s steps(1) infinite;\n",
       "  }\n",
       "\n",
       "  @keyframes spin {\n",
       "    0% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "      border-left-color: var(--fill-color);\n",
       "    }\n",
       "    20% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    30% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    40% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    60% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    80% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "    90% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "  }\n",
       "</style>\n",
       "\n",
       "  <script>\n",
       "    async function quickchart(key) {\n",
       "      const quickchartButtonEl =\n",
       "        document.querySelector('#' + key + ' button');\n",
       "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
       "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
       "      try {\n",
       "        const charts = await google.colab.kernel.invokeFunction(\n",
       "            'suggestCharts', [key], {});\n",
       "      } catch (error) {\n",
       "        console.error('Error during call to suggestCharts:', error);\n",
       "      }\n",
       "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
       "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
       "    }\n",
       "    (() => {\n",
       "      let quickchartButtonEl =\n",
       "        document.querySelector('#df-6839188e-c556-40cb-bd1d-5df38412a0b4 button');\n",
       "      quickchartButtonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "    })();\n",
       "  </script>\n",
       "</div>\n",
       "\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "                                              vector  \\\n",
       "0  [-0.029010361, -0.0019185242, 0.010176375, -0....   \n",
       "1  [-0.013530017, -0.0019778903, 0.009588092, -0....   \n",
       "2  [-0.02732811, -0.00016068136, 0.01769485, -0.0...   \n",
       "3  [-0.03583555, -0.00017553041, 0.009382936, -0....   \n",
       "4  [-0.010365823, -0.0064949114, 0.018580379, -0....   \n",
       "\n",
       "                                                text  \\\n",
       "0  Classifying short texts to one category or clu...   \n",
       "1  a r t i c l e i n f o a b s t r a c t We prese...   \n",
       "2  We introduce a technique for augmenting neural...   \n",
       "3  We introduce a long short-term memory recurren...   \n",
       "4  The presented work aims at generating a system...   \n",
       "\n",
       "                                               title  num_words  _distance  \n",
       "0   Using deep learning for short text understanding        187   0.354978  \n",
       "1  BabelNet: The automatic construction, evaluati...        133   0.402688  \n",
       "2  Deep Voice 2 : Multi-Speaker Neural Text-to-Sp...        151   0.417852  \n",
       "3  Speech-driven 3 D Facial Animation with Implic...        171   0.421976  \n",
       "4  BCSAT : A Benchmark Corpus for Sentiment Analy...        126   0.430035  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "table.search(\n",
    "    query,\n",
    "    query_type=\"vector\",\n",
    "    vector_column_name=\"vector\",\n",
    ").limit(5).to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZW-pg4Pt5K05"
   },
   "source": [
    "Now let's perform a hybrid search to combine the results from the full-text search and vector search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "mLNei8jkjrfI",
    "outputId": "3f0dcf69-4657-471f-cabb-a383495d7615"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "summary": "{\n  \"name\": \")\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"vector\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"Classifying short texts to one category or clustering semantically related texts is challenging, and the importance of both is growing due to the rise of microblogging platforms, digital news feeds, and the like. We can accomplish this classifying and clustering with the help of a deep neural network which produces compact binary representations of a short text, and can assign the same category to texts that have similar binary representations. But problems arise when there is little contextual information on the short texts, which makes it difficult for the deep neural network to produce similar binary codes for semantically related texts. We propose to address this issue using semantic enrichment. This is accomplished by taking the nouns, and verbs used in the short texts and generating the concepts and co-occurring words with the help of those terms. The nouns are used to generate concepts within the given short text, whereas the verbs are used to prune the ambiguous context (if any) present in the text. The enriched text then goes through a deep neural network to produce a prediction label for that short text representing it\\u2019s category.\",\n          \"Hierarchical Pitman-Yor Process priors are compelling methods for learning language models, outperforming point-estimate based methods. However, these models remain unpopular due to computational and statistical inference issues, such as memory and time usage, as well as poor mixing of sampler. In this work we propose a novel framework which represents the HPYP model compactly using compressed suffix trees. Then, we develop an efficient approximate inference scheme in this framework that has a much lower memory footprint compared to full HPYP and is fast in the inference time. The experimental results illustrate that our model can be built on significantly larger datasets compared to previous HPYP models, while being several orders of magnitudes smaller, fast for training and inference, and outperforming the perplexity of the state-of-the-art Modified Kneser-Ney countbased LM smoothing by up to 15%.\",\n          \"Sociological and technical difficulties, such as a lack of informal encounters, can make it difficult for new members of noncollocated software development teams to learn from their more experienced colleagues. To address this situation, we have developed a tool, named Hipikat that provides developers with efficient and effective access to the group memory for a software development project that is implicitly formed by all of the artifacts produced during the development. This project memory is built automatically with little or no change to existing work practices. After describing the Hipikat tool, we present two studies investigating Hipikat's usefulness in software modification tasks. One study evaluated the usefulness of Hipikat's recommendations on a sample of 20 modification tasks performed on the Eclipse Java IDE during the development of release 2.1 of the Eclipse software. We describe the study, present quantitative measures of Hipikat's performance, and describe in detail three cases that illustrate a range of issues that we have identified in the results. In the other study, we evaluated whether software developers who are new to a project can benefit from the artifacts that Hipikat recommends from the project memory. We describe the study, present qualitative observations, and suggest implications of using project memory as a learning aid for project newcomers.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"Using deep learning for short text understanding\",\n          \"Compressed Nonparametric Language Modelling\",\n          \"Hipikat: a project memory for software development\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"num_words\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 33,\n        \"min\": 133,\n        \"max\": 210,\n        \"num_unique_values\": 5,\n        \"samples\": [\n          187,\n          134,\n          210\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"_relevance_score\",\n      \"properties\": {\n        \"dtype\": \"float32\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.0314980149269104,\n          0.016393441706895828,\n          0.016129031777381897\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
       "type": "dataframe"
      },
      "text/html": [
       "\n",
       "  <div id=\"df-8d0ec91a-76f3-4155-8beb-a7f0f24d221f\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>vector</th>\n",
       "      <th>text</th>\n",
       "      <th>title</th>\n",
       "      <th>num_words</th>\n",
       "      <th>_relevance_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[-0.03583555, -0.00017553041, 0.009382936, -0....</td>\n",
       "      <td>We introduce a long short-term memory recurren...</td>\n",
       "      <td>Speech-driven 3 D Facial Animation with Implic...</td>\n",
       "      <td>171</td>\n",
       "      <td>0.031498</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[-0.029010361, -0.0019185242, 0.010176375, -0....</td>\n",
       "      <td>Classifying short texts to one category or clu...</td>\n",
       "      <td>Using deep learning for short text understanding</td>\n",
       "      <td>187</td>\n",
       "      <td>0.016393</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[-0.012242989, -0.003041715, 0.022755764, -0.0...</td>\n",
       "      <td>Sociological and technical difficulties, such ...</td>\n",
       "      <td>Hipikat: a project memory for software develop...</td>\n",
       "      <td>210</td>\n",
       "      <td>0.016393</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[-0.013530017, -0.0019778903, 0.009588092, -0....</td>\n",
       "      <td>a r t i c l e i n f o a b s t r a c t We prese...</td>\n",
       "      <td>BabelNet: The automatic construction, evaluati...</td>\n",
       "      <td>133</td>\n",
       "      <td>0.016129</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[-0.013615985, 0.024568425, 0.0012754148, -0.0...</td>\n",
       "      <td>Hierarchical Pitman-Yor Process priors are com...</td>\n",
       "      <td>Compressed Nonparametric Language Modelling</td>\n",
       "      <td>134</td>\n",
       "      <td>0.016129</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-8d0ec91a-76f3-4155-8beb-a7f0f24d221f')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-8d0ec91a-76f3-4155-8beb-a7f0f24d221f button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-8d0ec91a-76f3-4155-8beb-a7f0f24d221f');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "<div id=\"df-deb94f95-d35a-4904-8d26-cb6969ff94a3\">\n",
       "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-deb94f95-d35a-4904-8d26-cb6969ff94a3')\"\n",
       "            title=\"Suggest charts\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "     width=\"24px\">\n",
       "    <g>\n",
       "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
       "    </g>\n",
       "</svg>\n",
       "  </button>\n",
       "\n",
       "<style>\n",
       "  .colab-df-quickchart {\n",
       "      --bg-color: #E8F0FE;\n",
       "      --fill-color: #1967D2;\n",
       "      --hover-bg-color: #E2EBFA;\n",
       "      --hover-fill-color: #174EA6;\n",
       "      --disabled-fill-color: #AAA;\n",
       "      --disabled-bg-color: #DDD;\n",
       "  }\n",
       "\n",
       "  [theme=dark] .colab-df-quickchart {\n",
       "      --bg-color: #3B4455;\n",
       "      --fill-color: #D2E3FC;\n",
       "      --hover-bg-color: #434B5C;\n",
       "      --hover-fill-color: #FFFFFF;\n",
       "      --disabled-bg-color: #3B4455;\n",
       "      --disabled-fill-color: #666;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart {\n",
       "    background-color: var(--bg-color);\n",
       "    border: none;\n",
       "    border-radius: 50%;\n",
       "    cursor: pointer;\n",
       "    display: none;\n",
       "    fill: var(--fill-color);\n",
       "    height: 32px;\n",
       "    padding: 0;\n",
       "    width: 32px;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart:hover {\n",
       "    background-color: var(--hover-bg-color);\n",
       "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "    fill: var(--button-hover-fill-color);\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart-complete:disabled,\n",
       "  .colab-df-quickchart-complete:disabled:hover {\n",
       "    background-color: var(--disabled-bg-color);\n",
       "    fill: var(--disabled-fill-color);\n",
       "    box-shadow: none;\n",
       "  }\n",
       "\n",
       "  .colab-df-spinner {\n",
       "    border: 2px solid var(--fill-color);\n",
       "    border-color: transparent;\n",
       "    border-bottom-color: var(--fill-color);\n",
       "    animation:\n",
       "      spin 1s steps(1) infinite;\n",
       "  }\n",
       "\n",
       "  @keyframes spin {\n",
       "    0% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "      border-left-color: var(--fill-color);\n",
       "    }\n",
       "    20% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    30% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    40% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    60% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    80% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "    90% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "  }\n",
       "</style>\n",
       "\n",
       "  <script>\n",
       "    async function quickchart(key) {\n",
       "      const quickchartButtonEl =\n",
       "        document.querySelector('#' + key + ' button');\n",
       "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
       "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
       "      try {\n",
       "        const charts = await google.colab.kernel.invokeFunction(\n",
       "            'suggestCharts', [key], {});\n",
       "      } catch (error) {\n",
       "        console.error('Error during call to suggestCharts:', error);\n",
       "      }\n",
       "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
       "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
       "    }\n",
       "    (() => {\n",
       "      let quickchartButtonEl =\n",
       "        document.querySelector('#df-deb94f95-d35a-4904-8d26-cb6969ff94a3 button');\n",
       "      quickchartButtonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "    })();\n",
       "  </script>\n",
       "</div>\n",
       "\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "                                              vector  \\\n",
       "0  [-0.03583555, -0.00017553041, 0.009382936, -0....   \n",
       "1  [-0.029010361, -0.0019185242, 0.010176375, -0....   \n",
       "2  [-0.012242989, -0.003041715, 0.022755764, -0.0...   \n",
       "3  [-0.013530017, -0.0019778903, 0.009588092, -0....   \n",
       "4  [-0.013615985, 0.024568425, 0.0012754148, -0.0...   \n",
       "\n",
       "                                                text  \\\n",
       "0  We introduce a long short-term memory recurren...   \n",
       "1  Classifying short texts to one category or clu...   \n",
       "2  Sociological and technical difficulties, such ...   \n",
       "3  a r t i c l e i n f o a b s t r a c t We prese...   \n",
       "4  Hierarchical Pitman-Yor Process priors are com...   \n",
       "\n",
       "                                               title  num_words  \\\n",
       "0  Speech-driven 3 D Facial Animation with Implic...        171   \n",
       "1   Using deep learning for short text understanding        187   \n",
       "2  Hipikat: a project memory for software develop...        210   \n",
       "3  BabelNet: The automatic construction, evaluati...        133   \n",
       "4        Compressed Nonparametric Language Modelling        134   \n",
       "\n",
       "   _relevance_score  \n",
       "0          0.031498  \n",
       "1          0.016393  \n",
       "2          0.016393  \n",
       "3          0.016129  \n",
       "4          0.016129  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from lancedb.rerankers import RRFReranker\n",
    "\n",
    "reranker = RRFReranker()\n",
    "\n",
    "table.search(\n",
    "    query,\n",
    "    query_type=\"hybrid\",\n",
    "    vector_column_name=\"vector\",\n",
    ").rerank(\n",
    "    reranker\n",
    ").limit(5).to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ydUfLfSw5VTJ"
   },
   "source": [
    "Next, let's define a customer reranker to rank the hybrid search results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "BJ9sCJDilOoO",
    "outputId": "dce7b8e8-e5d1-48ec-a809-0ac17a3fa264"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "summary": "{\n  \"name\": \")\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"vector\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"Classifying short texts to one category or clustering semantically related texts is challenging, and the importance of both is growing due to the rise of microblogging platforms, digital news feeds, and the like. We can accomplish this classifying and clustering with the help of a deep neural network which produces compact binary representations of a short text, and can assign the same category to texts that have similar binary representations. But problems arise when there is little contextual information on the short texts, which makes it difficult for the deep neural network to produce similar binary codes for semantically related texts. We propose to address this issue using semantic enrichment. This is accomplished by taking the nouns, and verbs used in the short texts and generating the concepts and co-occurring words with the help of those terms. The nouns are used to generate concepts within the given short text, whereas the verbs are used to prune the ambiguous context (if any) present in the text. The enriched text then goes through a deep neural network to produce a prediction label for that short text representing it\\u2019s category.\",\n          \"Software organizations have typically de-emphasized the importance of software testing. In this paper, the results of a regional survey of software testing and software quality assurance techniques are described. Researchers conducted the study during the summer and fall of 2002 by surveying software organizations in the Province of Alberta. Results indicate that Alberta-based organizations tend to test less than their counterparts in the United States. The results also indicate that Alberta software organizations tend to train fewer personnel on testing-related topics. This practice has the potential for a two-fold impact: first, the ability to detect trends that lead to reduced quality and to identify the root causes of reductions in product quality may suffer from the lack of testing. This consequence is serious enough to warrant consideration, since overall quality may suffer from the reduced ability to detect and eliminate process or product defects. Second, the organization may have a more difficult time adopting methodologies such as extreme programming. This is significant because other industry studies have concluded that many software organizations have tried or will in the next few years try some form of agile method. Newer approaches to software development like extreme programming increase the extent to which teams rely on testing skills. Organizations should consider their testing skill level as a key indication of their readiness for adopting software development techniques such as test-driven development, extreme programming, agile modelling, or other agile methods.\",\n          \"Sociological and technical difficulties, such as a lack of informal encounters, can make it difficult for new members of noncollocated software development teams to learn from their more experienced colleagues. To address this situation, we have developed a tool, named Hipikat that provides developers with efficient and effective access to the group memory for a software development project that is implicitly formed by all of the artifacts produced during the development. This project memory is built automatically with little or no change to existing work practices. After describing the Hipikat tool, we present two studies investigating Hipikat's usefulness in software modification tasks. One study evaluated the usefulness of Hipikat's recommendations on a sample of 20 modification tasks performed on the Eclipse Java IDE during the development of release 2.1 of the Eclipse software. We describe the study, present quantitative measures of Hipikat's performance, and describe in detail three cases that illustrate a range of issues that we have identified in the results. In the other study, we evaluated whether software developers who are new to a project can benefit from the artifacts that Hipikat recommends from the project memory. We describe the study, present qualitative observations, and suggest implications of using project memory as a learning aid for project newcomers.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"Using deep learning for short text understanding\",\n          \"A survey of software testing practices in alberta\",\n          \"Hipikat: a project memory for software development\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"num_words\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 33,\n        \"min\": 151,\n        \"max\": 236,\n        \"num_unique_values\": 5,\n        \"samples\": [\n          187,\n          236,\n          210\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"_relevance_score\",\n      \"properties\": {\n        \"dtype\": \"float32\",\n        \"num_unique_values\": 4,\n        \"samples\": [\n          0.016393441706895828,\n          0.015384615398943424,\n          0.0314980149269104\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
       "type": "dataframe"
      },
      "text/html": [
       "\n",
       "  <div id=\"df-8d5a7c04-c5f5-4f0d-af27-601ba6dca64e\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>vector</th>\n",
       "      <th>text</th>\n",
       "      <th>title</th>\n",
       "      <th>num_words</th>\n",
       "      <th>_relevance_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[-0.03583555, -0.00017553041, 0.009382936, -0....</td>\n",
       "      <td>We introduce a long short-term memory recurren...</td>\n",
       "      <td>Speech-driven 3 D Facial Animation with Implic...</td>\n",
       "      <td>171</td>\n",
       "      <td>0.031498</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[-0.029010361, -0.0019185242, 0.010176375, -0....</td>\n",
       "      <td>Classifying short texts to one category or clu...</td>\n",
       "      <td>Using deep learning for short text understanding</td>\n",
       "      <td>187</td>\n",
       "      <td>0.016393</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[-0.012242989, -0.003041715, 0.022755764, -0.0...</td>\n",
       "      <td>Sociological and technical difficulties, such ...</td>\n",
       "      <td>Hipikat: a project memory for software develop...</td>\n",
       "      <td>210</td>\n",
       "      <td>0.016393</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>[-0.02732811, -0.00016068136, 0.01769485, -0.0...</td>\n",
       "      <td>We introduce a technique for augmenting neural...</td>\n",
       "      <td>Deep Voice 2 : Multi-Speaker Neural Text-to-Sp...</td>\n",
       "      <td>151</td>\n",
       "      <td>0.015873</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>[0.0013170651, -0.022660466, 0.014392637, -0.0...</td>\n",
       "      <td>Software organizations have typically de-empha...</td>\n",
       "      <td>A survey of software testing practices in alberta</td>\n",
       "      <td>236</td>\n",
       "      <td>0.015385</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-8d5a7c04-c5f5-4f0d-af27-601ba6dca64e')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-8d5a7c04-c5f5-4f0d-af27-601ba6dca64e button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-8d5a7c04-c5f5-4f0d-af27-601ba6dca64e');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "<div id=\"df-640de625-7c77-4b2e-b974-9b8685440b85\">\n",
       "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-640de625-7c77-4b2e-b974-9b8685440b85')\"\n",
       "            title=\"Suggest charts\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
       "     width=\"24px\">\n",
       "    <g>\n",
       "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
       "    </g>\n",
       "</svg>\n",
       "  </button>\n",
       "\n",
       "<style>\n",
       "  .colab-df-quickchart {\n",
       "      --bg-color: #E8F0FE;\n",
       "      --fill-color: #1967D2;\n",
       "      --hover-bg-color: #E2EBFA;\n",
       "      --hover-fill-color: #174EA6;\n",
       "      --disabled-fill-color: #AAA;\n",
       "      --disabled-bg-color: #DDD;\n",
       "  }\n",
       "\n",
       "  [theme=dark] .colab-df-quickchart {\n",
       "      --bg-color: #3B4455;\n",
       "      --fill-color: #D2E3FC;\n",
       "      --hover-bg-color: #434B5C;\n",
       "      --hover-fill-color: #FFFFFF;\n",
       "      --disabled-bg-color: #3B4455;\n",
       "      --disabled-fill-color: #666;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart {\n",
       "    background-color: var(--bg-color);\n",
       "    border: none;\n",
       "    border-radius: 50%;\n",
       "    cursor: pointer;\n",
       "    display: none;\n",
       "    fill: var(--fill-color);\n",
       "    height: 32px;\n",
       "    padding: 0;\n",
       "    width: 32px;\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart:hover {\n",
       "    background-color: var(--hover-bg-color);\n",
       "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "    fill: var(--button-hover-fill-color);\n",
       "  }\n",
       "\n",
       "  .colab-df-quickchart-complete:disabled,\n",
       "  .colab-df-quickchart-complete:disabled:hover {\n",
       "    background-color: var(--disabled-bg-color);\n",
       "    fill: var(--disabled-fill-color);\n",
       "    box-shadow: none;\n",
       "  }\n",
       "\n",
       "  .colab-df-spinner {\n",
       "    border: 2px solid var(--fill-color);\n",
       "    border-color: transparent;\n",
       "    border-bottom-color: var(--fill-color);\n",
       "    animation:\n",
       "      spin 1s steps(1) infinite;\n",
       "  }\n",
       "\n",
       "  @keyframes spin {\n",
       "    0% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "      border-left-color: var(--fill-color);\n",
       "    }\n",
       "    20% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    30% {\n",
       "      border-color: transparent;\n",
       "      border-left-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    40% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-top-color: var(--fill-color);\n",
       "    }\n",
       "    60% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "    }\n",
       "    80% {\n",
       "      border-color: transparent;\n",
       "      border-right-color: var(--fill-color);\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "    90% {\n",
       "      border-color: transparent;\n",
       "      border-bottom-color: var(--fill-color);\n",
       "    }\n",
       "  }\n",
       "</style>\n",
       "\n",
       "  <script>\n",
       "    async function quickchart(key) {\n",
       "      const quickchartButtonEl =\n",
       "        document.querySelector('#' + key + ' button');\n",
       "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
       "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
       "      try {\n",
       "        const charts = await google.colab.kernel.invokeFunction(\n",
       "            'suggestCharts', [key], {});\n",
       "      } catch (error) {\n",
       "        console.error('Error during call to suggestCharts:', error);\n",
       "      }\n",
       "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
       "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
       "    }\n",
       "    (() => {\n",
       "      let quickchartButtonEl =\n",
       "        document.querySelector('#df-640de625-7c77-4b2e-b974-9b8685440b85 button');\n",
       "      quickchartButtonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "    })();\n",
       "  </script>\n",
       "</div>\n",
       "\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "                                              vector  \\\n",
       "0  [-0.03583555, -0.00017553041, 0.009382936, -0....   \n",
       "1  [-0.029010361, -0.0019185242, 0.010176375, -0....   \n",
       "2  [-0.012242989, -0.003041715, 0.022755764, -0.0...   \n",
       "5  [-0.02732811, -0.00016068136, 0.01769485, -0.0...   \n",
       "8  [0.0013170651, -0.022660466, 0.014392637, -0.0...   \n",
       "\n",
       "                                                text  \\\n",
       "0  We introduce a long short-term memory recurren...   \n",
       "1  Classifying short texts to one category or clu...   \n",
       "2  Sociological and technical difficulties, such ...   \n",
       "5  We introduce a technique for augmenting neural...   \n",
       "8  Software organizations have typically de-empha...   \n",
       "\n",
       "                                               title  num_words  \\\n",
       "0  Speech-driven 3 D Facial Animation with Implic...        171   \n",
       "1   Using deep learning for short text understanding        187   \n",
       "2  Hipikat: a project memory for software develop...        210   \n",
       "5  Deep Voice 2 : Multi-Speaker Neural Text-to-Sp...        151   \n",
       "8  A survey of software testing practices in alberta        236   \n",
       "\n",
       "   _relevance_score  \n",
       "0          0.031498  \n",
       "1          0.016393  \n",
       "2          0.016393  \n",
       "5          0.015873  \n",
       "8          0.015385  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from typing import List, Union\n",
    "import pandas as pd\n",
    "import pyarrow as pa\n",
    "\n",
    "\n",
    "class MofidifiedRRF(RRFReranker):\n",
    "    def __init__(self, filters: Union[str, List[str]], **kwargs):\n",
    "        super().__init__(**kwargs)\n",
    "        filters = filters if isinstance(filters, list) else [filters]\n",
    "        self.filters = filters\n",
    "\n",
    "    def rerank_hybrid(\n",
    "        self, query: str, vector_results: pa.Table, fts_results: pa.Table\n",
    "    ) -> pa.Table:\n",
    "        combined_result = super().rerank_hybrid(query, vector_results, fts_results)\n",
    "        df = combined_result.to_pandas()\n",
    "        for filter in self.filters:\n",
    "            df = df.query(\n",
    "                \"(not text.str.contains(@filter)) & (num_words > 150) \"\n",
    "            )  # THIS is where you implement your filters. You can hard code or pass dynamically too\n",
    "\n",
    "        return pa.Table.from_pandas(df)\n",
    "\n",
    "\n",
    "modified_reranker = MofidifiedRRF(filters=[\"dual-band\"])\n",
    "\n",
    "table.search(\n",
    "    query,\n",
    "    query_type=\"hybrid\",\n",
    "    vector_column_name=\"vector\",\n",
    ").rerank(\n",
    "    reranker=modified_reranker\n",
    ").limit(5).to_pandas()"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
